Scholarship project
  • Introduction
  • Data Cleaning
  • EDA
  • Analysis

On this page

  • Manipulate Dataset
    • Average Scenarios
    • All Scenarios
  • Data Exploration
    • Basic Statistics
    • Location
    • Temperature/Percipitation Trends
  • Perspectives to Consider
  • Statistical Significance
  • Conclusion
  • Next Steps!

Dataset EDA

Author

JaeHo Bahng

Published

May 19, 2024


Import module / Set options and theme
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import xml.etree.ElementTree as ET
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import ttest_rel
from statsmodels.stats.weightstats import ttest_ind
import numpy as np
import pingouin as pg
from scipy.stats import zscore
import plotly.graph_objects as go
import pandas as pd
from plotly.subplots import make_subplots
import warnings
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt


warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 10)
Import cleaned data
df = pd.read_csv('../data/cleaned_df.csv')
df['Location_ID'] = df.groupby(['long', 'lat']).ngroup() + 1

Manipulate Dataset

Average Scenarios

The Average Scenarios dataset averages all the numerical columns of the scenarios into one row, outputing one row for each Location, Year, and RCP. This dataset is used when conducting EDA and visualizing overtrend

Clean data (Average Scenarios)
group_list = ['Park', 'long', 'lat', 'veg', 'year', 'TimePeriod', 'RCP','treecanopy', 'Ann_Herb', 'Bare', 'Herb', 'Litter', 'Shrub', 'El', 'Sa','Cl', 'RF', 'Slope', 'E', 'S']
veg_location = df.drop(labels='scenario',axis=1).groupby(group_list).mean().reset_index()
# veg_location['T_Annual'] = (veg_location['T_Annual'] - veg_location['T_Annual'].min()) / (veg_location['T_Annual'].max() - veg_location['T_Annual'].min())

# Convert to numeric, coercing errors to NaN
numeric_series = pd.to_numeric(veg_location['RCP'], errors='coerce')

numeric_series

# Fill NaNs with original non-numeric values
veg_location['RCP'] = numeric_series.fillna(veg_location['RCP'])

four = veg_location[veg_location['RCP'].isin([4.5])]
eight = veg_location[veg_location['RCP'].isin([8.5])]
four_h = veg_location[veg_location['RCP'].isin(['historical'])]
four_h['RCP'] = 4.5
eight_h = veg_location[veg_location['RCP'].isin(['historical'])]
eight_h['RCP'] = 8.5

df_con = pd.concat([four_h, four, eight_h, eight], ignore_index=True)
df_con['Location_ID'] = df_con.groupby(['long', 'lat']).ngroup() + 1

df_con.head(5)
Park long lat veg year TimePeriod RCP treecanopy Ann_Herb Bare Herb Litter Shrub El Sa Cl RF Slope E S T_P_Corr DrySoilDays_Winter_top50 DrySoilDays_Spring_top50 DrySoilDays_Summer_top50 DrySoilDays_Fall_top50 DrySoilDays_Winter_whole DrySoilDays_Spring_whole DrySoilDays_Summer_whole DrySoilDays_Fall_whole Evap_Winter Evap_Spring Evap_Summer Evap_Fall ExtremeShortTermDryStress_Winter_top50 ExtremeShortTermDryStress_Spring_top50 ExtremeShortTermDryStress_Summer_top50 ExtremeShortTermDryStress_Fall_top50 ExtremeShortTermDryStress_Winter_whole ExtremeShortTermDryStress_Spring_whole ExtremeShortTermDryStress_Summer_whole ExtremeShortTermDryStress_Fall_whole FrostDays_Winter FrostDays_Spring FrostDays_Summer FrostDays_Fall NonDrySWA_Winter_top50 NonDrySWA_Spring_top50 NonDrySWA_Summer_top50 NonDrySWA_Fall_top50 NonDrySWA_Winter_whole NonDrySWA_Spring_whole NonDrySWA_Summer_whole NonDrySWA_Fall_whole PET_Winter PET_Spring PET_Summer PET_Fall PPT_Winter PPT_Spring PPT_Summer PPT_Fall SemiDryDuration_Annual_top50 SemiDryDuration_Annual_whole SWA_Winter_top50 SWA_Spring_top50 SWA_Summer_top50 SWA_Fall_top50 SWA_Winter_whole SWA_Spring_whole SWA_Summer_whole SWA_Fall_whole T_Winter T_Spring T_Summer T_Fall Tmax_Winter Tmax_Spring Tmax_Summer Tmax_Fall Tmin_Winter Tmin_Spring Tmin_Summer Tmin_Fall Transp_Winter Transp_Spring Transp_Summer Transp_Fall VWC_Winter_top50 VWC_Spring_top50 VWC_Summer_top50 VWC_Fall_top50 VWC_Winter_whole VWC_Spring_whole VWC_Summer_whole VWC_Fall_whole WetSoilDays_Winter_top50 WetSoilDays_Spring_top50 WetSoilDays_Summer_top50 WetSoilDays_Fall_top50 WetSoilDays_Winter_whole WetSoilDays_Spring_whole WetSoilDays_Summer_whole WetSoilDays_Fall_whole PPT_Annual T_Annual RL Location_ID
0 NABR -110.0472 37.60413 Shrubland 1980 Hist 4.5 0 0 84 5 11 7 1764.955 77.03307 6.082058 2.285707 1949.283 -8753.784 4834.13 -0.6636760860 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.7140658366 6.3995308949 1.5598074021 3.3632779979 NaN 24.34 36.16 29.52 NaN 24.34 36.16 29.52 75.0 34.0 0.0 26.0 3.4668806371 2.6546632530 0.0321140671 0.4880867481 3.4668806371 2.6546632530 0.0321140671 0.4880867481 7.7811633032 31.1394527955 48.0177480655 21.9156825756 13.79 8.71 2.69 6.37 36.5000000000 36.5000000000 3.4668806371 2.6546632530 0.0321140671 0.4880867481 3.4668806371 2.6546632530 0.0321140671 0.4880867481 0.96483520 8.767935 23.15924 11.962090 14.15 28.75 37.05 31.15 -12.45 -7.35 5.55 -10.25 0.2370806 5.296833 1.067496 1.9667860 0.1134468701 0.0968307001 0.0418759016 0.0522975530 0.1134468701 0.0968307001 0.0418759016 0.0522975530 91.0 77.0 5.0 47.0 91.0 77.0 5.0 47.0 31.56 11.21352505 54.57202074 1
1 NABR -110.0472 37.60413 Shrubland 1981 Hist 4.5 0 0 84 5 11 7 1764.955 77.03307 6.082058 2.285707 1949.283 -8753.784 4834.13 0.3478010620 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.1815202084 5.9723378265 5.0428776741 4.6374034668 13.92 26.53 36.08 NaN 13.92 26.53 36.08 NaN 79.0 26.0 0.0 13.0 0.3461917264 0.8982752558 0.0336629893 2.5013360811 0.3461917264 0.8982752558 0.0336629893 2.5013360811 8.1229049607 32.3882557036 48.1772426406 21.7575735702 2.25 9.81 9.39 11.75 13.2500000000 13.2500000000 0.3461917264 0.8982752558 0.0336629893 2.5013360811 0.3461917264 0.8982752558 0.0336629893 2.5013360811 3.33444400 10.548370 23.27065 11.581320 17.05 28.15 37.55 29.75 -9.35 -5.55 1.25 -7.25 0.2930753 3.506108 3.916328 2.7875470 0.0493818430 0.0607271763 0.0426386771 0.0936706801 0.0493818430 0.0607271763 0.0426386771 0.0936706801 48.0 60.0 13.0 85.0 48.0 60.0 13.0 85.0 33.20 12.18369600 54.57202074 1
2 NABR -110.0472 37.60413 Shrubland 1982 Hist 4.5 0 0 84 5 11 7 1764.955 77.03307 6.082058 2.285707 1949.283 -8753.784 4834.13 0.3260300992 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.2589947135 4.7173273934 4.5276363327 4.2477717540 NaN 26.19 34.99 22.06 NaN 26.19 34.99 22.06 83.0 21.0 0.0 30.0 3.2599844936 1.5994982052 0.1993822366 1.2432253150 3.2599844936 1.5994982052 0.1993822366 1.2432253150 7.3379526955 31.4894498184 47.1800768757 21.0684231651 4.12 5.10 9.50 9.83 17.2857142857 17.2857142857 3.2599844936 1.5994982052 0.1993822366 1.2432253150 3.2599844936 1.5994982052 0.1993822366 1.2432253150 -0.01555556 9.472283 22.05707 9.869231 14.35 28.45 36.65 31.85 -16.55 -7.25 5.65 -6.25 0.2453347 3.105047 3.523923 2.8900990 0.1092341982 0.0748166564 0.0456102615 0.0677891794 0.1092341982 0.0748166564 0.0456102615 0.0677891794 90.0 62.0 19.0 73.0 90.0 62.0 19.0 73.0 28.55 10.34575711 54.57202074 1
3 NABR -110.0472 37.60413 Shrubland 1983 Hist 4.5 0 0 84 5 11 7 1764.955 77.03307 6.082058 2.285707 1949.283 -8753.784 4834.13 0.0388273872 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.7419915365 6.2671578978 5.1695757094 3.7751048188 NaN 28.56 33.69 31.02 NaN 28.56 33.69 31.02 85.0 32.0 0.0 19.0 3.8064480379 2.9456592119 0.0960442305 1.5835966242 3.8064480379 2.9456592119 0.0960442305 1.5835966242 7.4798456947 30.3128312703 46.5762368398 21.8471460016 7.09 10.80 10.22 10.40 16.7142857143 16.7142857143 3.8064480379 2.9456592119 0.0960442305 1.5835966242 3.8064480379 2.9456592119 0.0960442305 1.5835966242 0.40944440 8.020652 21.32826 11.325820 13.35 30.65 34.55 33.15 -15.05 -7.25 3.85 -8.95 0.2252735 4.962824 5.006576 1.1952350 0.1204177901 0.1025422325 0.0441405046 0.0748017843 0.1204177901 0.1025422325 0.0441405046 0.0748017843 90.0 74.0 15.0 69.0 90.0 74.0 15.0 69.0 38.51 10.27104410 54.57202074 1
4 NABR -110.0472 37.60413 Shrubland 1984 Hist 4.5 0 0 84 5 11 7 1764.955 77.03307 6.082058 2.285707 1949.283 -8753.784 4834.13 0.2166602692 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.6272686835 5.0078604793 5.2303324404 4.0803373430 NaN 30.95 34.01 29.52 NaN 30.95 34.01 29.52 91.0 35.0 0.0 30.0 3.7945975224 1.7555326024 0.0452782946 1.3792575946 3.7945975224 1.7555326024 0.0452782946 1.3792575946 7.1730101555 31.9972417196 47.0386757592 21.0183982059 4.77 4.32 9.49 8.17 16.5000000000 16.5000000000 3.7945975224 1.7555326024 0.0452782946 1.3792575946 3.7945975224 1.7555326024 0.0452782946 1.3792575946 -1.04725300 9.853804 21.95978 10.034070 10.25 32.75 35.35 31.35 -18.45 -8.45 2.95 -12.45 0.1226868 3.120243 4.269040 0.9273169 0.1202091711 0.0778415354 0.0431793330 0.0703661709 0.1202091711 0.0778415354 0.0431793330 0.0703661709 91.0 65.0 16.0 62.0 91.0 65.0 16.0 62.0 26.75 10.20010025 54.57202074 1

All Scenarios

Code
# Convert to numeric, coercing errors to NaN
numeric_series = pd.to_numeric(df['RCP'], errors='coerce')

numeric_series

# Fill NaNs with original non-numeric values
df['RCP'] = numeric_series.fillna(df['RCP'])

four = df[df['RCP'].isin([4.5])]
eight = df[df['RCP'].isin([8.5])]
four_h = df[df['RCP'].isin(['historical'])]
four_h['RCP'] = 4.5
eight_h = df[df['RCP'].isin(['historical'])]
eight_h['RCP'] = 8.5

df_orig = pd.concat([four_h, four, eight_h, eight], ignore_index=True)
df_orig['Location_ID'] = df_orig.groupby(['long', 'lat']).ngroup() + 1

df_orig.head(5)
Park long lat veg year TimePeriod RCP scenario treecanopy Ann_Herb Bare Herb Litter Shrub El Sa Cl RF Slope E S T_P_Corr DrySoilDays_Winter_top50 DrySoilDays_Spring_top50 DrySoilDays_Summer_top50 DrySoilDays_Fall_top50 DrySoilDays_Winter_whole DrySoilDays_Spring_whole DrySoilDays_Summer_whole DrySoilDays_Fall_whole Evap_Winter Evap_Spring Evap_Summer Evap_Fall ExtremeShortTermDryStress_Winter_top50 ExtremeShortTermDryStress_Spring_top50 ExtremeShortTermDryStress_Summer_top50 ExtremeShortTermDryStress_Fall_top50 ExtremeShortTermDryStress_Winter_whole ExtremeShortTermDryStress_Spring_whole ExtremeShortTermDryStress_Summer_whole ExtremeShortTermDryStress_Fall_whole FrostDays_Winter FrostDays_Spring FrostDays_Summer FrostDays_Fall NonDrySWA_Winter_top50 NonDrySWA_Spring_top50 NonDrySWA_Summer_top50 NonDrySWA_Fall_top50 NonDrySWA_Winter_whole NonDrySWA_Spring_whole NonDrySWA_Summer_whole NonDrySWA_Fall_whole PET_Winter PET_Spring PET_Summer PET_Fall PPT_Winter PPT_Spring PPT_Summer PPT_Fall SemiDryDuration_Annual_top50 SemiDryDuration_Annual_whole SWA_Winter_top50 SWA_Spring_top50 SWA_Summer_top50 SWA_Fall_top50 SWA_Winter_whole SWA_Spring_whole SWA_Summer_whole SWA_Fall_whole T_Winter T_Spring T_Summer T_Fall Tmax_Winter Tmax_Spring Tmax_Summer Tmax_Fall Tmin_Winter Tmin_Spring Tmin_Summer Tmin_Fall Transp_Winter Transp_Spring Transp_Summer Transp_Fall VWC_Winter_top50 VWC_Spring_top50 VWC_Summer_top50 VWC_Fall_top50 VWC_Winter_whole VWC_Spring_whole VWC_Summer_whole VWC_Fall_whole WetSoilDays_Winter_top50 WetSoilDays_Spring_top50 WetSoilDays_Summer_top50 WetSoilDays_Fall_top50 WetSoilDays_Winter_whole WetSoilDays_Spring_whole WetSoilDays_Summer_whole WetSoilDays_Fall_whole PPT_Annual T_Annual RL Location_ID
0 NABR -110.0472 37.60413 Shrubland 1980 Hist 4.5 sc1 0 0 84 5 11 7 1764.955 77.03307 6.082058 2.285707 1949.283 -8753.784 4834.13 -0.6636760860 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.7140658366 6.3995308949 1.5598074021 3.3632779979 NaN 24.34 36.16 29.52 NaN 24.34 36.16 29.52 75.0 34.0 0.0 26.0 3.4668806371 2.6546632530 0.0321140671 0.4880867481 3.4668806371 2.6546632530 0.0321140671 0.4880867481 7.7811633032 31.1394527955 48.0177480655 21.9156825756 13.79 8.71 2.69 6.37 36.5000000000 36.5000000000 3.4668806371 2.6546632530 0.0321140671 0.4880867481 3.4668806371 2.6546632530 0.0321140671 0.4880867481 0.96483520 8.767935 23.15924 11.962090 14.15 28.75 37.05 31.15 -12.45 -7.35 5.55 -10.25 0.2370806 5.296833 1.067496 1.9667860 0.1134468701 0.0968307001 0.0418759016 0.0522975530 0.1134468701 0.0968307001 0.0418759016 0.0522975530 91.0 77.0 5.0 47.0 91.0 77.0 5.0 47.0 31.56 11.21352505 54.57202074 1
1 NABR -110.0472 37.60413 Shrubland 1981 Hist 4.5 sc1 0 0 84 5 11 7 1764.955 77.03307 6.082058 2.285707 1949.283 -8753.784 4834.13 0.3478010620 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.1815202084 5.9723378265 5.0428776741 4.6374034668 13.92 26.53 36.08 NaN 13.92 26.53 36.08 NaN 79.0 26.0 0.0 13.0 0.3461917264 0.8982752558 0.0336629893 2.5013360811 0.3461917264 0.8982752558 0.0336629893 2.5013360811 8.1229049607 32.3882557036 48.1772426406 21.7575735702 2.25 9.81 9.39 11.75 13.2500000000 13.2500000000 0.3461917264 0.8982752558 0.0336629893 2.5013360811 0.3461917264 0.8982752558 0.0336629893 2.5013360811 3.33444400 10.548370 23.27065 11.581320 17.05 28.15 37.55 29.75 -9.35 -5.55 1.25 -7.25 0.2930753 3.506108 3.916328 2.7875470 0.0493818430 0.0607271763 0.0426386771 0.0936706801 0.0493818430 0.0607271763 0.0426386771 0.0936706801 48.0 60.0 13.0 85.0 48.0 60.0 13.0 85.0 33.20 12.18369600 54.57202074 1
2 NABR -110.0472 37.60413 Shrubland 1982 Hist 4.5 sc1 0 0 84 5 11 7 1764.955 77.03307 6.082058 2.285707 1949.283 -8753.784 4834.13 0.3260300992 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.2589947135 4.7173273934 4.5276363327 4.2477717540 NaN 26.19 34.99 22.06 NaN 26.19 34.99 22.06 83.0 21.0 0.0 30.0 3.2599844936 1.5994982052 0.1993822366 1.2432253150 3.2599844936 1.5994982052 0.1993822366 1.2432253150 7.3379526955 31.4894498184 47.1800768757 21.0684231651 4.12 5.10 9.50 9.83 17.2857142857 17.2857142857 3.2599844936 1.5994982052 0.1993822366 1.2432253150 3.2599844936 1.5994982052 0.1993822366 1.2432253150 -0.01555556 9.472283 22.05707 9.869231 14.35 28.45 36.65 31.85 -16.55 -7.25 5.65 -6.25 0.2453347 3.105047 3.523923 2.8900990 0.1092341982 0.0748166564 0.0456102615 0.0677891794 0.1092341982 0.0748166564 0.0456102615 0.0677891794 90.0 62.0 19.0 73.0 90.0 62.0 19.0 73.0 28.55 10.34575711 54.57202074 1
3 NABR -110.0472 37.60413 Shrubland 1983 Hist 4.5 sc1 0 0 84 5 11 7 1764.955 77.03307 6.082058 2.285707 1949.283 -8753.784 4834.13 0.0388273872 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.7419915365 6.2671578978 5.1695757094 3.7751048188 NaN 28.56 33.69 31.02 NaN 28.56 33.69 31.02 85.0 32.0 0.0 19.0 3.8064480379 2.9456592119 0.0960442305 1.5835966242 3.8064480379 2.9456592119 0.0960442305 1.5835966242 7.4798456947 30.3128312703 46.5762368398 21.8471460016 7.09 10.80 10.22 10.40 16.7142857143 16.7142857143 3.8064480379 2.9456592119 0.0960442305 1.5835966242 3.8064480379 2.9456592119 0.0960442305 1.5835966242 0.40944440 8.020652 21.32826 11.325820 13.35 30.65 34.55 33.15 -15.05 -7.25 3.85 -8.95 0.2252735 4.962824 5.006576 1.1952350 0.1204177901 0.1025422325 0.0441405046 0.0748017843 0.1204177901 0.1025422325 0.0441405046 0.0748017843 90.0 74.0 15.0 69.0 90.0 74.0 15.0 69.0 38.51 10.27104410 54.57202074 1
4 NABR -110.0472 37.60413 Shrubland 1984 Hist 4.5 sc1 0 0 84 5 11 7 1764.955 77.03307 6.082058 2.285707 1949.283 -8753.784 4834.13 0.2166602692 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.6272686835 5.0078604793 5.2303324404 4.0803373430 NaN 30.95 34.01 29.52 NaN 30.95 34.01 29.52 91.0 35.0 0.0 30.0 3.7945975224 1.7555326024 0.0452782946 1.3792575946 3.7945975224 1.7555326024 0.0452782946 1.3792575946 7.1730101555 31.9972417196 47.0386757592 21.0183982059 4.77 4.32 9.49 8.17 16.5000000000 16.5000000000 3.7945975224 1.7555326024 0.0452782946 1.3792575946 3.7945975224 1.7555326024 0.0452782946 1.3792575946 -1.04725300 9.853804 21.95978 10.034070 10.25 32.75 35.35 31.35 -18.45 -8.45 2.95 -12.45 0.1226868 3.120243 4.269040 0.9273169 0.1202091711 0.0778415354 0.0431793330 0.0703661709 0.1202091711 0.0778415354 0.0431793330 0.0703661709 91.0 65.0 16.0 62.0 91.0 65.0 16.0 62.0 26.75 10.20010025 54.57202074 1

Data Exploration

Basic Statistics

Basic Statistics

  • 79 years of prediction (2021~2099)
  • 40 scenarios (sc22~sc61)
  • 2 RCP scenarios(4.5, 8.5)
  • 113 locations

Explanation

  • The data is collected over 113 locations within the Natural Bridge National Monument. (Number of Unique latitude, longitude combinations)
  • This dataset is composed of 41 years of historical data and 79 years worth of predictions. Since there can be only one scenario for past data, all historical data is labeled as ‘sc1’ or scenario one
  • From the predicted years (2021 to 2099), There are two RCP scenarios which changes only the RCP variable and 40 scenarios which simulate 86 other variables.

Based on each combination of scenarios, a prediction is made for each location point regarding various columns of the dataset including annual and seasonal percipitation, seasonal dry soil days, seasonal evaporation, seasonal extreme short term dry stress, soil water availability to output a final prediction for Annual and seasonal temperatures.

What is RCP?

Representative Concentration Pathways : A group of scenarios where CO2 emmission is predicted like the image below

  • The dataset consists of two RCP scenarios 4.5 and 8.5

source : Representative Concentration Pathway. (2024, May 2). In Wikipedia. https://en.wikipedia.org/wiki/Representative_Concentration_Pathway

Location

Where is this data located and how does it look like?

The data points were sampled at the Natural Bridge National Monument in Utah. And for a better idea of The plots below show two different location aspects of the dataset. The first plot is the average annual temperature for each location point in the year 2099. Since the temperature for predictions increase over time, the last year for the dataset was chosen for a more dramatic comparison

The second plot is a scatter plot of the locations of vegetations. By comparing the two graphs, we can tell that there isn’t much correlation with vegetation and annual temperature but rather a correlation with the location(latitude, longitude) and temperature. We will get to this in the following visualizations.

Map Visualizations
map = df_con[df_con['year']==2099].groupby(['long','lat'])['T_Annual'].mean().reset_index()

filtered_df = map
fig = px.scatter_mapbox(filtered_df, lat="lat", lon="long", color="T_Annual", size="T_Annual",
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=8, zoom=11,
                  mapbox_style="open-street-map")

fig.update_layout(
    title={
        'text': "<b>Average Temperature (2099) </b>",
        'y': 0.97,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    margin={"r": 0, "t": 40, "l": 0, "b": 0}
    )

fig.show()

map = df_con[df_con['year']==2099].groupby(['long','lat','veg']).size().reset_index()

filtered_df = map

# Create the scatter mapbox
fig = px.scatter_mapbox(map, lat="lat", lon="long", color="veg",
                        color_continuous_scale=px.colors.cyclical.IceFire, size_max=8, zoom=11,
                        mapbox_style="open-street-map")

# Update the layout with the new legend title and position
fig.update_layout(
    title={
        'text': "<b>Vegetation Location</b>",
        'y': 0.97,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    coloraxis_colorbar={
        'title': 'Vegetation Level'  # Change this to your desired legend title
    },
    legend={
        'x': 1,  # Position the legend to the right
        'y': 0.8,  # Center the legend vertically
        'xanchor': 'left',  # Anchor the legend's x position to the left side
        'yanchor': 'middle'  # Anchor the legend's y position to the middle
    },
    margin={"r": 0, "t": 40, "l": 0, "b": 0}
)
fig.update_traces(marker=dict(size=10))  # Set the desired fixed marker size

# Show the figure
fig.show()

Temperature/Percipitation Trends

The following plots were drawn by averaging all scenarios, locations, and RCPs for a given year for annual temperature and annual percipitation to see the overall trend of the predictions of the dataset. Predictions were made from the year 2021 which is

We can see that the predictions portray an increase in temperature but a fluctuation with percipitation allowing us to make an educated guess that temperature is the more important variable for RCP scenarios which deal with CO2 emission.

Temperature / Percipitation Predictions Overview
# Assuming 'veg_location' is your DataFrame
# Filter the DataFrame for 'RCP' values 'historical' and 4.5
filtered_data = df_con.groupby(['year'])['T_Annual'].mean().reset_index()

# Create the line plot
fig = px.line(
    data_frame=filtered_data,
    x='year',
    y='T_Annual',
    title='<b>Annual Temperature</b>',
    labels={'T_Annual': 'Annual Temperature'},
    line_shape='spline'
)

# Add a vertical line at year 2021
fig.add_shape(
    dict(
        type='line',
        x0=2021,
        y0=filtered_data['T_Annual'].min()/1.1,
        x1=2021,
        y1=filtered_data['T_Annual'].max()*1.1,
        line=dict(
            color="Red",
            width=2,
            dash="dash",
        ),
    )
)

fig.add_annotation(
    dict(
        x=2021,  # Position the text to the right of the line
        y=filtered_data['T_Annual'].max(),  # Position the text at the middle of the y-axis
        xref="x",
        yref="y",
        text="Prediction",
        showarrow=False,
        font=dict(
            size=12,
            color="Red"
        ),
        align="center",
        xanchor="left"
    )
)

fig.update_layout(title={'x':0.5})
# Show the plot
fig.show()


# Assuming 'veg_location' is your DataFrame
# Filter the DataFrame for 'RCP' values 'historical' and 4.5
filtered_data = df_con.groupby(['year'])['PPT_Annual'].mean().reset_index()

# Create the line plot
fig = px.line(
    data_frame=filtered_data,
    x='year',
    y='PPT_Annual',
    title='<b>Annual Precipitation</b>',
    labels={'T_Annual': 'Annual Temperature'},
    line_shape='spline'
)

# Add a vertical line at year 2021
fig.add_shape(
    dict(
        type='line',
        x0=2021,
        y0=filtered_data['PPT_Annual'].min()/1.1,
        x1=2021,
        y1=filtered_data['PPT_Annual'].max()*1.1,
        line=dict(
            color="Red",
            width=2,
            dash="dash",
        ),
    )
)

fig.add_annotation(
    dict(
        x=2021,  # Position the text to the right of the line
        y=filtered_data['PPT_Annual'].max(),  # Position the text at the middle of the y-axis
        xref="x",
        yref="y",
        text="Prediction",
        showarrow=False,
        font=dict(
            size=12,
            color="Red"
        ),
        align="center",
        xanchor="left"
    )
)

fig.update_layout(title={'x':0.5})
# Show the plot
fig.show()

Perspectives to Consider

What are some aspects of the datasets we can slice and dice or drill down to compare and retrieve meaningful insights?

Most numerical features are generated based on the scenario of the model that generated future data, and some numerical features such ase S,E,Slope, RF, Cl, Sa, El, treecanopy etc. are features that are fixed according to a unique location. Therefore categorical variables are the aspects of the datasets we can compare to create new insights

Categorical Variables :

  • RCP
  • Vegetation
  • Scenario

The following plots compare the predicted annual temperature for each category for the three categorical variables

Temperature RCP comparison
# Assuming 'veg_location' is your DataFrame
# Filter the DataFrame for 'RCP' values 'historical' and 4.5
filtered_data = df_con.groupby(['year','RCP'])['T_Annual'].mean().reset_index()

# Create the line plot
fig = px.line(
    data_frame=filtered_data,
    x='year',
    y='T_Annual',
    color='RCP',  # This will create lines for each unique value in 'veg' and color them differently
    title='<b>Annual Temperature by RCP Type</b>',
    labels={'T_Annual': 'Annual Temperature'},
    line_shape='spline'
)
fig.update_layout(title={'x':0.5})

# Add a vertical line at year 2021
fig.add_shape(
    dict(
        type='line',
        x0=2021,
        y0=filtered_data['T_Annual'].min()/1.1,
        x1=2021,
        y1=filtered_data['T_Annual'].max()*1.1,
        line=dict(
            color="Red",
            width=2,
            dash="dash",
        ),
    )
)

fig.add_annotation(
    dict(
        x=2021,  # Position the text to the right of the line
        y=filtered_data['T_Annual'].max(),  # Position the text at the middle of the y-axis
        xref="x",
        yref="y",
        text="Prediction",
        showarrow=False,
        font=dict(
            size=12,
            color="Red"
        ),
        align="center",
        xanchor="left"
    )
)

# Show the plot
fig.show()

Since RCP deals with CO2 emission and the 8.5 scenario has a higher emission prediction than the 4.5 scenario, the annual temperature increase of RCP 8.5 is more rapid than rcp4.5 although both are increasing.

Temperature comparison (Vegetation)
# Filter the DataFrame for 'RCP' values 'historical' and 4.5
filtered_data = df_con[df_con['RCP'].isin(['historical', 4.5])].groupby(['year','veg'])['T_Annual'].mean().reset_index()

# Create the line plot
fig = px.line(
    data_frame=filtered_data,
    x='year',
    y='T_Annual',
    color='veg',  # This will create lines for each unique value in 'veg' and color them differently
    title='<b>Annual Temperature by Vegetation Type</b>',
    labels={'T_Annual': 'Annual Temperature'}
)
fig.update_layout(title={'x':0.5})


# Add a vertical line at year 2021
fig.add_shape(
    dict(
        type='line',
        x0=2021,
        y0=filtered_data['T_Annual'].min()/1.1,
        x1=2021,
        y1=filtered_data['T_Annual'].max()*1.1,
        line=dict(
            color="Red",
            width=2,
            dash="dash",
        ),
    )
)

fig.add_annotation(
    dict(
        x=2021,  # Position the text to the right of the line
        y=filtered_data['T_Annual'].max(),  # Position the text at the middle of the y-axis
        xref="x",
        yref="y",
        text="Prediction",
        showarrow=False,
        font=dict(
            size=12,
            color="Red"
        ),
        align="center",
        xanchor="left"
    )
)

# Show the plot
fig.show()

The vegetations seem to follow exactly the same pattern of prediciton with a fixed difference between each other. This may mean that when calculating predictions based on scenarios, the algorithm was modeled so that the mean of the vegetations were always a given distance apart from each other. Because of this limitation of the algorithm, it is unncessary to compare vegetations from each other. We will always get the same difference.

Temperature comparison (scenario)
# Assuming df_orig is your DataFrame and it has been filtered to exclude 'Hist' from 'TimePeriod'
df_filtered = df_orig[df_orig['TimePeriod'] != 'Hist']

# Calculate the median of 'T_Annual' for each 'scenario'
medians = df_filtered.groupby('scenario')['T_Annual'].median().reset_index()

# Sort the median values
medians = medians.sort_values('T_Annual')

# Merge sorted median DataFrame back to the filtered DataFrame
df_sorted = pd.merge(medians['scenario'], df_filtered, on='scenario', how='left')

# Creating a boxplot with Plotly Express using the sorted DataFrame
fig = px.box(df_sorted, x='scenario', y='T_Annual', color='RCP')

# Rotating x-axis labels
fig.update_layout(
    xaxis_tickangle=-90,
    title={
        'text': "<b>Annual Temperature by Scenario</b>",
        'x':0.5,
        'xanchor': 'center'
    }
)
# Displaying the plot
fig.show()

Since we already know that RCP plays a big role in how the algorithm predicts the temperature, We will divide the scenarios into 4.5 scenarios and 8.5 scenarios to see if there is a significant difference. By examining the plot we now know that RCP 4.5 correspons to scenario 22~41 and RCP 8.5 correspons to scenario 42~61. There are cases where 4.5 scenarios had higher temperatures than 8.5 scenarios, but since RCP acts as the first drill down layer of the dataset, we can use the scenario as the second drilldown of the dataset.

Statistical Significance

Is there a significant difference between different scenarios?

Before we start analyzing our dataset, one final step we want to take is proving the statistical significance in the different scenarios we plan on comparing.

The three comparisons we plan on making are as follows: 1. RCP 8.5(High) vs RCP 4.5(Low) 2. RCP 4.5 : Scenario 37(High) vs Scenario 40(Low) 2. RCP 8.5 : Scenario 60(High) vs Scenario 58(Low)

T-test for RCP 4.5 and 8.8
# Creating data groups
data_before = df_orig[(df_orig['RCP'] == 8.5) & (df_orig['TimePeriod'] != 'Hist')]['T_Annual']
data_after = df_orig[(df_orig['RCP'] == 4.5) & (df_orig['TimePeriod'] != 'Hist')]['T_Annual']
 
# Conducting two-sample ttest
result = pg.ttest(data_before,
                  data_after,
                  correction=True)
 

# # Print the result
# print("t value : ",result['T'][0])
# print("95% Confidence Interval : ", result['CI95%'][0])
# print("p-value : ", result['p-val'][0])

T-test for RCP 4.5 and 8.8

Result Value
t-value 232.998
95% Confidence Interval [1.25 1.27]
p-Value 0.000
T-test for Scenario 40 vs 37
# Creating data groups
data_before = df_orig[df_orig['scenario'] == 'sc40']['T_Annual']
data_after = df_orig[df_orig['scenario'] == 'sc37']['T_Annual']
 
# Conducting two-sample ttest
result = pg.ttest(data_before,
                  data_after,
                  correction=True)
 
# # Print the result
# # Print the result
# print("t value : ",result['T'][0])
# print("95% Confidence Interval : ", result['CI95%'][0])
# print("p-value : ", result['p-val'][0])

T-test for Scenario 40 vs 37

Result Value
t-value -157.977
95% Confidence Interval [-2.51 -2.45]
p-Value 0.000
T-test for Scenario 60 vs 58
# Creating data groups
data_before = df_orig[df_orig['scenario'] == 'sc60']['T_Annual']
data_after = df_orig[df_orig['scenario'] == 'sc58']['T_Annual']
 
# Conducting two-sample ttest
result = pg.ttest(data_before,
                  data_after,
                  correction=True)
 
# # Print the result
# print("t value : ",result['T'][0])
# print("95% Confidence Interval : ", result['CI95%'][0])
# print("p-value : ", result['p-val'][0])

T-test for Scenario 60 vs 58

Result Value
t-value -125.742
95% Confidence Interval [-3.61 -3.5 ]
p-Value 0.000

Conclusion

For our dataset analysis, we will be comparing the maximum and minimum scenario for each RCP group to analyze what features affect temperature the most. That is comparing scenario 37 to scenario 40 for RCP 4.5 scenarios, and comparing scenario 58 to scenario 60 to do the same for RCP 8.5.

Next Steps!

Now that we’ve proved that the differenc between RCP scenarios, and the highest and lowest scenario for each RCP are all statistically significant, lets dive deeper into the dataset to construct visualizations to hypothesize features that have correlations to the predicted temperature!